Recognition of spontaneous conversational speech using long short-term memory phoneme predictions
نویسندگان
چکیده
We present a novel continuous speech recognition framework designed to unite the principles of triphone and Long ShortTerm Memory (LSTM) modeling. The LSTM principle allows a recurrent neural network to store and to retrieve information over long time periods, which was shown to be well-suited for the modeling of co-articulation effects in human speech. Our system uses a bidirectional LSTM network to generate a phoneme prediction feature that is observed by a triphone-based large-vocabulary continuous speech recognition (LVCSR) decoder, together with conventional MFCC features. We evaluate both, phoneme prediction error rates of various network architectures and the word recognition performance of our Tandem approach using the COSINE database a large corpus of conversational and noisy speech, and show that incorporating LSTM phoneme predictions in to an LVCSR system leads to significantly higher word accuracies.
منابع مشابه
Robust Multi-stream Keyword and Non-linguistic Vocalization Detection for Computationally Intelligent Virtual Agents
Systems for keyword and non-linguistic vocalization detection in conversational agent applications need to be robust with respect to background noise and different speaking styles. Focussing on the Sensitive Artificial Listener (SAL) scenario which involves spontaneous, emotionally colored speech, this paper proposes a multi-stream model that applies the principle of Long Short-Term Memory to g...
متن کاملVerbal-Auditory Skills in 5-year-Old Children of Semnan/Iran in 2006
Introduction: This research was planned to determine some verbal-auditory skills (verbal-auditory short memory and phonological awareness) that have the closest relationship with speech and language development in 5-year-old children. Method: In this descriptive cross-sectional study, 400 children of pre-school classes affiliated to Education and Welfare organizations in Semnan city were select...
متن کاملSpeech Emotion Recognition Using Scalogram Based Deep Structure
Speech Emotion Recognition (SER) is an important part of speech-based Human-Computer Interface (HCI) applications. Previous SER methods rely on the extraction of features and training an appropriate classifier. However, most of those features can be affected by emotionally irrelevant factors such as gender, speaking styles and environment. Here, an SER method has been proposed based on a concat...
متن کاملBidirectional LSTM Networks for Improved Phoneme Classification and Recognition
In this paper, we carry out two experiments on the TIMIT speech corpus with bidirectional and unidirectional Long Short Term Memory (LSTM) networks. In the first experiment (framewise phoneme classification) we find that bidirectional LSTM outperforms both unidirectional LSTM and conventional Recurrent Neural Networks (RNNs). In the second (phoneme recognition) we find that a hybrid BLSTM-HMM s...
متن کاملCombining Bottleneck-BLSTM and Semi-Supervised Sparse NMF for Recognition of Conversational Speech in Highly Instationary Noise
We address the speaker independent automatic recognition of spontaneous speech in highly variable noise by applying semisupervised sparse non-negative matrix factorization (NMF) for speech enhancement coupled with our recently proposed frontend utilizing bottleneck (BN) features generated by a bidirectional Long Short-Term Memory (BLSTM) recurrent neural network. In our evaluation, we unite the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010